In Brief

The mosaic package provides facilities for putting model fitting into a “mathematical function” framework.

Use with lm( ), glm( )

Example:

Mortality at 20 years of follow-up for smokers and non-smokers.

sample( Whickham, size=5 )
##      outcome smoker age orig.id
## 665    Alive     No  22     665
## 781    Alive     No  32     781
## 1062    Dead     No  67    1062
## 956     Dead     No  72     956
## 200    Alive     No  51     200
smod <- glm( outcome=='Alive' ~ age*smoker, family='binomial',
             data=Whickham)
f <- makeFun(smod) # EXTRACT THE FUNCTION
## Warning in inferTransformation(formula(object)): You may need to specify
## transformation to get the desired results.

Evaluation … For a 60-year old …

f( age=60, smoker=c('Yes'))
##        1 
## 0.512897

Plot data and the functional form

xyplot( outcome=='Alive' ~ age | smoker, data=Whickham,
        pch=20, alpha=.2 )

plotFun( f(age=age, smoker="No") ~ age, add=TRUE)

plotFun( f(age=age, smoker="Yes") ~ age, add=TRUE, 
         col='red', lwd=2)

lm()

hmod <- lm( height ~ father, data=subset(Galton,sex=="M"))

Extract the function … just like resid( ) or fitted( )

hfun <- makeFun(hmod)

Plot it … first data, then the function.

xyplot(height ~ father | sex, data=Galton)

plotFun( hfun(x) ~ x, add=TRUE)

Multiple variables:

mod2 <- lm( height ~ father + sex, data=Galton)
hfun2 <- makeFun(mod2)
xyplot(height ~ father | sex, data=Galton)

plotFun( hfun2(x, sex="F") ~ x, add=TRUE)

plotFun( hfun2(x, sex="M") ~ x, add=TRUE, col='red')

Or …

mod3 <- lm( height ~ father*mother + sex, data=Galton)
hfun3 <- makeFun(mod3)
plotFun( hfun3(father=x, mother=y, sex="F") ~ x&y,
         x.lim=c(60,80),y.lim=c(57,73)) 

plotPoints( mother ~ father, data=Galton, add=TRUE,
            pch=20,col='red')

Confidence Intervals

hfun2( father=66, sex=c('M','F'), interval='confidence')
##        fit      lwr      upr
## 1 67.87340 67.59130 68.15550
## 2 62.69736 62.40424 62.99049
hfun2( father=66, sex=c('M','F'), interval='prediction')
##        fit      lwr      upr
## 1 67.87340 63.39609 72.35072
## 2 62.69736 58.21934 67.17539

NOTE: ggplot( ) does this well.

ggplot(data=Galton, aes(x=father, y=height)) + geom_point()  + 
  aes(colour=sex)  + stat_smooth(method=lm) + 
  theme(legend.position="none") + labs(title="") 

Convenient interface via mosaic:mPlot().

mplot

Not so brief introduction

One of the objectives of Project MOSAIC is to make connections between statistics and mathematics and computation. The mosaic package offers some features specifically oriented to doing math.

These are suggestions or demonstrations appropriate in teaching statistical modeling to students not afraid of math, e.g. engineers, statistics minors, …

f <- makeFun( 3*x + 2 ~ x)

Read this as, “make a function f that takes \(x\) as a input and produces \(3x + 2\) as the output.

Evaluating the function is a matter of providing an input:

f(4)
## [1] 14

It’s easy to plot functions:

plotFun( 3*x + 2 ~ x, x.lim=range(-5,5) )

Or, if we already have the function f, give \(x\) as an input:

plotFun( f(x) ~ x, x.lim=range(-5,5) )

Functions of multiple variables follow the same scheme:

g <- makeFun( 3*x*y + 2*x -4*y +2 ~ x&y)
g(x=0,y=2)
## [1] -6
plotFun( g(x=x,y=y)~x&y, 
         x.lim=range(-2,2),y.lim=range(-2,2) )

Or even cross-sections or parametrically …

plotFun( g(x=x, y=0.5) ~ x, x.lim=range(-2,2) )

plotFun( g(x=3*cos(t),y=sin(t))~ t, t.lim=range(-5,5))

Why is this relevant to statistical modeling?

Statistical Conventions

The workhorses of model building in R are lm() and glm(). They are great, but unfortunately they return values that are not what students are used to.

In particular, students are used to lines expressed as \(y = m x + b\), etc.

lm() gives back something different, e.g.

lm( height ~ father, data=Galton )
## (Intercept)      father 
##  39.1103868   0.3993813

How different this is from \(mx+b\).

And then, what happens when you plot a model …

m = lm( height ~ father, data=Galton )
plot(m)

Where lm( ) shines

The lm() function was written by statisticians for statisticians. Confidence intervals and p-values are easy:

coef(summary(m))
##               Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 39.1103868 3.22706259 12.119501 2.055761e-31
## father       0.3993813 0.04658212  8.573704 4.354588e-17

lm() is particularly valuable when there are multiple explanatory variables.

m2 <- lm( height ~ sex + father + mother, data=Galton)
coef(summary(m2))
##               Estimate Std. Error   t value      Pr(>|t|)
## (Intercept) 15.3447600 2.74696121  5.586085  3.082284e-08
## sexM         5.2259513 0.14400791 36.289335 5.786616e-178
## father       0.4059780 0.02920696 13.900044  6.525648e-40
## mother       0.3214951 0.03128178 10.277392  1.701771e-23

Overcoming the Differences

One strategy is to start with what students already know, and build on that.

  1. Remind students that it’s really \(y(x) = mx + b\): \(y\) is a function of \(x\). There are two variables involved.
  2. Point out that \(x\) and \(y\) are simply names for variables.
    • We could write, e.g. \(z(t) = mt + b\).
    • One more step takes us to mnemonic variable names, the height of a child taken as a function of the father’s height (as Galton put it). height(father) = m * father + b
  3. Then introduce function fitting in the context of the function height(father):
    • There’s some data: Galton in this case.
    • Variable names are set by the data:

      names(Galton)
      ## [1] "family" "father" "mother" "sex"    "height" "nkids"
    • We don’t expect the child’s height to be exactly \(=\), only approximately. So height(father) ~ m * father + b
  4. And do the fitting:

    f <- fitModel( height ~ m*father + b, data=Galton)

You plot this the same way as “mathematical” functions:

plotFun( f(father=x)~x, x.lim=range(60,80) )

If you want to talk about residuals, least squares, etc, you can do so. For instance, have students guess the function before fitting it …

xyplot( height ~ father, data=Galton)

myGuess <- makeFun(40 + .35*x ~ x )
plotFun( myGuess(x) ~ x, add=TRUE, col='red')

plotFun( f(father=x) ~ x, add=TRUE, col='green')

Which is better?

mean( (myGuess(father)-height)^2, data=Galton )
## [1] 18.26244
mean( (f(father) - height)^2, data=Galton)
## [1] 11.85077

Students can explore these questions:

f0 <- fitModel(height ~ a + 0*father, data=Galton)
f1 <- fitModel(height ~ a + b*father, data=Galton)
f2 <- fitModel(height ~ a + b*father + c*father^2, data=Galton)

xyplot( height ~ father, data=Galton)

plotFun( f0(father=x)~x, add=TRUE)

plotFun( f1(father=x)~x, add=TRUE, col='red')

plotFun( f2(father=x)~x, add=TRUE, col='green')

Which function is better? (Hint: it doesn’t depend on the data!)

mean( ~ I((f0(father)-height)^2), data=Galton )
## [1] 12.82301
mean( ~ I((f1(father)-height)^2), data=Galton )
## [1] 11.85077
mean( ~ I((f2(father)-height)^2), data=Galton )
## [1] 11.82975

lm( ) makes a function.

It just isn’t packaged as such. We can do so with mosaic::makeFun()

m1 <- lm( height ~ father, data=Galton)
g1 <- makeFun(m1)
g1( father=65 )
##        1 
## 65.07017

Functions with more than 1 input

While we’re at it …

Let’s do what Galton never could: Include women!

And we can take a liberal view about how the child came to be: the father and the mother multiplying:

mod <- lm( height ~ sex + father*mother, data=Galton )
f <- makeFun( mod )
f( sex='F', father=65, mother=65 )
##        1 
## 62.59029
plotFun( f(sex='M', father=x, mother=y) ~ x & y, 
         x.lim=range(60,80), y.lim=range(60,80))

  • What does the function look like without the interaction?
  • What if you include the child’s sex as an interaction?

Functions more generally

g <- smoother( height ~ father, data=Galton, degree=1 )
xyplot(height ~ father, data=Galton) 

plotFun( g(father=x)~x, add=TRUE, lwd=3 )

plotFun( f1(father=x)~x, add=TRUE, col='red', 
         lwd=3, alpha=.5)

Where the data are, the two functions are basically the same.

Modeling probabilities

Example: Mortality at 20 years of follow-up for smokers and non-smokers.

sample( Whickham, size=5 )
##      outcome smoker age orig.id
## 665    Alive     No  22     665
## 781    Alive     No  32     781
## 1062    Dead     No  67    1062
## 956     Dead     No  72     956
## 200    Alive     No  51     200

A \(\chi^2\)-test? Really?

counts <- tally( ~ outcome & smoker, data=Whickham )
chisq.test( counts )
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  counts
## X-squared = 8.7515, df = 1, p-value = 0.003093

And what’s the effect size? In what direction?

Conditional Probabilities

tally( outcome ~ smoker, data=Whickham )
##        smoker
## outcome  No Yes
##   Alive 502 443
##   Dead  230 139

Really? Smokers had a larger chance of surviving?

Models can include covariates

Just as there are smoothers, there are function types suited to Dead/Alive outcomes … e.g. logistic regression.

smod <- glm( outcome=='Alive' ~ age*smoker, family='binomial',
             data=Whickham)
f <- makeFun(smod)
## Warning in inferTransformation(formula(object)): You may need to specify
## transformation to get the desired results.

For a 60-year old …

f( age=60, smoker=c('Yes','No'))
##         1         2 
## 0.5128970 0.5437244

A function is a function, whatever the form!

xyplot( outcome=='Alive' ~ age | smoker, data=Whickham,
        pch=20, alpha=.2 )

plotFun( f(age=age, smoker="No") ~ age, add=TRUE)

plotFun( f(age=age, smoker="Yes") ~ age, add=TRUE, 
         col='red', lwd=2)

Inference

actual <- lm(height ~ father+sex, data=Galton)
f <- makeFun( actual )
plotFun(f(x, sex='F')~x, x.lim=range(65,80))
# Confidence interval?
for (k in 1:50) {
  trial <- lm(height ~ father+sex, data=resample(Galton) )
  f <- makeFun( trial )
  print(plotFun(f(x, sex='F')~x, add=TRUE, col='blue', alpha=.5))
  1
  }
# Null Hypothesis?
for (k in 1:10) {
  trial <- lm(height ~ father+shuffle(sex), data=Galton )
  f <- makeFun( trial )
  print(plotFun(f(x, sex='F')~x, add=TRUE, col='red', alpha=.5))
  1
  }